library(reshape2)
library(corrplot)
## corrplot 0.90 loaded
library(ggplot2)
library(C50)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(nnet)
library(NeuralNetTools)
library(rpart)
library(rpart.plot)
library(caret)
## Loading required package: lattice
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.2 ✓ purrr 0.3.4
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x randomForest::combine() masks dplyr::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x randomForest::margin() masks ggplot2::margin()
options(scipen=999)
In this section, we will explore our data and develop an understanding of the information available to us. Our overall goal is to determine which records in the set may be prone to higher chances of mortality.
fetal_df <- read.csv(file = '/Users/bikramgill/Desktop/fetal_health.csv')
As we see below, there are no missing values in this dataset. Thus, no action will be taken in this regard.
## Count of missing values for each column.
sapply(fetal_df, function(x) sum(is.na(x)))
## baseline.value
## 0
## accelerations
## 0
## fetal_movement
## 0
## uterine_contractions
## 0
## light_decelerations
## 0
## severe_decelerations
## 0
## prolongued_decelerations
## 0
## abnormal_short_term_variability
## 0
## mean_value_of_short_term_variability
## 0
## percentage_of_time_with_abnormal_long_term_variability
## 0
## mean_value_of_long_term_variability
## 0
## histogram_width
## 0
## histogram_min
## 0
## histogram_max
## 0
## histogram_number_of_peaks
## 0
## histogram_number_of_zeroes
## 0
## histogram_mode
## 0
## histogram_mean
## 0
## histogram_median
## 0
## histogram_variance
## 0
## histogram_tendency
## 0
## fetal_health
## 0
Correlations will be used to reduce the feature set down initially to those that have more of a relation to fetal_health, our target variable.
Further EDA will be conducted on the remaining feature set.
We see from the first visual below that there are no features that have a strong correlation to fetal_health; with the highest correlation being prolonged_decelerations (0.48). Based on the correlations of this dataset; if a minimum correlation of abs(0.20) were used; there would be 10 major features of interest. These have been listed in order of absolute correlation below.
prolongued_decelerations - 0.485
abnormal_short_term_variability - 0.471
percentage_of_time_with_abnormal_long_term_variability - 0.426
accelerations - 0.364
histogram_mode - 0.250
histogram_mean - 0.227
mean_value_of_long_term_variability - 0.227
histogram_variance - 0.207
histogram_median - 0.205
uterine_contractions - 0.204
Of these 10 features, the second visual will be used to ensure that the features are not highly correlated to one another, so as to avoid weighting the model to a particular direction. If variables are found to be highly correlated to each other, the variable with the higher correlation to fetal_health will be retained and the other removed.
options(repr.plot.width = 25, repr.plot.height = 25)
fetal_health_corr <- cor(x = fetal_df$fetal_health,y = fetal_df[1:21])
corrplot::corrplot(fetal_health_corr, tl.cex=0.5, method = "number")
options(repr.plot.width = 25, repr.plot.height = 25)
fetal_corr <- cor(fetal_df)
corrplot::corrplot(fetal_corr, tl.cex=0.4)
The following function has been defined and used to remove outliers from columns columns based on the analyses from section Distributions and Outlier Analysis.
Outliers have been defined as following:
First Quartile = Q1 Third Quartile = Q3 Interquartile Range = IQR
Outliers are any points < (Q1 - (1.5 * IQR)) or points > (Q3 + (1.5 * IQR))
outliers <- function(x) {
Q1 <- quantile(x, probs=.25)
Q3 <- quantile(x, probs=.75)
iqr = Q3-Q1
upper_limit = Q3 + (iqr*1.5)
lower_limit = Q1 - (iqr*1.5)
x > upper_limit | x < lower_limit
}
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
df <- df[!outliers(df[[col]]),]
}
df
}
fetal_df2 <- remove_outliers(fetal_df, c('accelerations', 'baseline.value'))
In this section, boxplots and histograms of the columns in this dataset will be explored for a visual representation of any outliers or interesting distributions within the data.
Following this, for columns where outliers are present, a decision will be made as to whether to remove them, keep them or normalize them.
This analysis may also be used to remove features that add too much noise to the model or do not contribute anything meaningful.
## Create train and test sets; to be used later for modelling
## set the seed to make your partition reproducible
set.seed(7)
sample_size = round(nrow(fetal_df)*.80)
index <- sample(seq_len(nrow(fetal_df)), size = sample_size)
fetal_train <- fetal_df[index, ]
fetal_test <- fetal_df[-index, ]
hist(fetal_df$baseline.value,
main="Histogram for baseline.value",
xlab="baseline.value",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), baseline.value)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("baseline.value boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$accelerations,
main="Histogram for accelerations",
xlab="accelerations",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), accelerations)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("accelerations boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$fetal_movement,
main="Histogram for fetal_movement",
xlab="fetal_movement",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), fetal_movement)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("fetal_movement boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$uterine_contractions,
main="Histogram for uterine_contractions",
xlab="uterine_contractions",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), uterine_contractions)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("uterine_contractions boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$light_decelerations,
main="Histogram for light_decelerations",
xlab="light_decelerations",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), light_decelerations)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("light_decelerations boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$severe_decelerations,
main="Histogram for severe_decelerations",
xlab="severe_decelerations",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), severe_decelerations)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("severe_decelerations boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$prolongued_decelerations,
main="Histogram for prolongued_decelerations",
xlab="prolongued_decelerations",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), prolongued_decelerations)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("prolongued_decelerations boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$abnormal_short_term_variability,
main="Histogram for abnormal_short_term_variability",
xlab="abnormal_short_term_variability",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), abnormal_short_term_variability)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("abnormal_short_term_variability boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$mean_value_of_short_term_variability,
main="Histogram for mean_value_of_short_term_variability",
xlab="mean_value_of_short_term_variability",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), mean_value_of_short_term_variability)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("mean_value_of_short_term_variability boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$percentage_of_time_with_abnormal_long_term_variability,
main="Histogram for percentage_of_time_with_abnormal_long_term_variability",
xlab="percentage_of_time_with_abnormal_long_term_variability",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), percentage_of_time_with_abnormal_long_term_variability)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("percentage_of_time_with_abnormal_long_term_variability boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$mean_value_of_long_term_variability,
main="Histogram for mean_value_of_long_term_variability",
xlab="mean_value_of_long_term_variability",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), mean_value_of_long_term_variability)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("mean_value_of_long_term_variability boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_width,
main="Histogram for histogram_width",
xlab="histogram_width",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_width)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_width boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_min,
main="Histogram for histogram_min",
xlab="histogram_min",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_min)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_min boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_max,
main="Histogram for histogram_max",
xlab="histogram_max",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_max)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_max boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_number_of_peaks,
main="Histogram for histogram_number_of_peaks",
xlab="histogram_number_of_peaks",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_number_of_peaks)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_number_of_peaks boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_number_of_zeroes,
main="Histogram for histogram_number_of_zeroes",
xlab="histogram_number_of_zeroes",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_number_of_zeroes)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_number_of_zeroes boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_mode,
main="Histogram for histogram_mode",
xlab="histogram_mode",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_mode)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_mode boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_mean,
main="Histogram for histogram_mean",
xlab="histogram_mean",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_mean)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_mean boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_median,
main="Histogram for histogram_median",
xlab="histogram_median",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_median)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_median boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_variance,
main="Histogram for histogram_variance",
xlab="histogram_variance",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_variance)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_variance boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$histogram_tendency,
main="Histogram for histogram_tendency",
xlab="histogram_tendency",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), histogram_tendency)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("histogram_tendency boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
hist(fetal_df$fetal_health,
main="Histogram for fetal_health",
xlab="fetal_health",
border="black",
col="wheat")
fetal_df %>%
ggplot( aes(as.numeric(row.names(fetal_df)), fetal_health)) +
geom_boxplot(color="tomato3", fill="wheat", alpha=0.2) +
geom_jitter(color="black", size=0.4, alpha=0.35) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("fetal_health boxplot") +
xlab("observation no.")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?